Importing the necessary libraries
library(dplyr)
library(tidyverse)
library(ggplot2)
Reading the data
data <- read.csv("/home/riya/BRN/gapminder_clean.csv")%>%
as.tibble()
head(data)
## # A tibble: 6 × 20
## X Country.Name Year Agriculture..value.added....…¹ CO2.emissions..metri…²
## <int> <chr> <int> <dbl> <dbl>
## 1 0 Afghanistan 1962 NA 0.0738
## 2 1 Afghanistan 1967 NA 0.124
## 3 2 Afghanistan 1972 NA 0.131
## 4 3 Afghanistan 1977 NA 0.183
## 5 4 Afghanistan 1982 NA 0.166
## 6 5 Afghanistan 1987 NA 0.276
## # ℹ abbreviated names: ¹Agriculture..value.added....of.GDP.,
## # ²CO2.emissions..metric.tons.per.capita.
## # ℹ 15 more variables:
## # Domestic.credit.provided.by.financial.sector....of.GDP. <dbl>,
## # Electric.power.consumption..kWh.per.capita. <dbl>,
## # Energy.use..kg.of.oil.equivalent.per.capita. <dbl>,
## # Exports.of.goods.and.services....of.GDP. <dbl>, …
1. Filter the data to include only rows where Year is 1962
and
a) make a scatter plot comparing ‘CO2 emissions (metric tons per
capita)’ and gdpPercap for the filtered data
b) calculate the correlation of ’CO2 emissions (metric tons per
capita)’and gdpPercap. What is the correlation and associated p
value?
#filtering the data to include rows where Year is equal to 1962
filtered_data1<-data %>%
filter(Year==1962)
filtered_data1 %>%
ggplot(aes(x=CO2.emissions..metric.tons.per.capita.,y=gdpPercap))+
geom_point()#+
# ggsave("scatterPlot.png",path="/home/riya/BRN/Plots")
cor_res<- cor.test(filtered_data1$CO2.emissions..metric.tons.per.capita.,filtered_data1$gdpPercap)
cor_res
##
## Pearson's product-moment correlation
##
## data: filtered_data1$CO2.emissions..metric.tons.per.capita. and filtered_data1$gdpPercap
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8934697 0.9489792
## sample estimates:
## cor
## 0.9260817
cor_res$p.value
## [1] 1.128679e-46
The correlation coefficient of approximately 0.9261 suggests a very
strong positive linear relationship between “CO2 emissions
(metric tons per capita)” and “GDP per capita”. The confidence interval
between 0.8934697 and 0.9489792 further supports this correlation.
2. On the unfiltered data, answer “In what year is the correlation between ’CO2 emissions (metric tons per capita)’and gdpPercap the strongest?” Filter the dataset to that year for the next step…
data %>%
filter(complete.cases(CO2.emissions..metric.tons.per.capita.,gdpPercap))%>%
group_by(Year)%>%
summarise(cor=cor(CO2.emissions..metric.tons.per.capita.,gdpPercap))%>%
slice(which.max(cor))
## # A tibble: 1 × 2
## Year cor
## <int> <dbl>
## 1 1967 0.939
filtered_data2 <- data%>%
filter(Year==1967)
head(filtered_data2)
## # A tibble: 6 × 20
## X Country.Name Year Agriculture..value.added..…¹ CO2.emissions..metri…²
## <int> <chr> <int> <dbl> <dbl>
## 1 1 Afghanistan 1967 NA 0.124
## 2 11 Albania 1967 NA 1.36
## 3 21 Algeria 1967 10.3 0.632
## 4 31 American Samoa 1967 NA NA
## 5 41 Andorra 1967 NA NA
## 6 51 Angola 1967 NA 0.167
## # ℹ abbreviated names: ¹Agriculture..value.added....of.GDP.,
## # ²CO2.emissions..metric.tons.per.capita.
## # ℹ 15 more variables:
## # Domestic.credit.provided.by.financial.sector....of.GDP. <dbl>,
## # Electric.power.consumption..kWh.per.capita. <dbl>,
## # Energy.use..kg.of.oil.equivalent.per.capita. <dbl>,
## # Exports.of.goods.and.services....of.GDP. <dbl>, …
3. Using plotly, create an interactive scatter plot comparing ’CO2 emissions (metric tons per capita)’and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent
library(plotly)
p <- filtered_data2 %>%
ggplot(aes(x=CO2.emissions..metric.tons.per.capita.,y=gdpPercap,color=continent))+
geom_point(aes(pop))
ggplotly(p)
#ggsave("emVsgdp_scatterplot.png",plot=p,path ="/home/riya/BRN/Plots" )
4. What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’?
# plotting a boxplot to visualise the relationship between these variables
data %>%
ggplot(aes(x=continent,y=Energy.use..kg.of.oil.equivalent.per.capita.))+
geom_boxplot()
# ggsave("boxplot.png",path="/home/riya/BRN/Plots")
Here, from above plot there seems to some differences in the energy across different continents, particularly - Asia, Europe and Oceania(highest median observed for Oceania). We will test significance of these differences statistically using ANOVA test.
aov_model <- aov(data$Energy.use..kg.of.oil.equivalent.per.capita. ~ data$continent)
summary(aov_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## data$continent 5 8.124e+08 162482656 21.88 <2e-16 ***
## Residuals 1404 1.043e+10 7426183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1197 observations deleted due to missingness
Here, the observed p-value is very small(<2e-16) and provides a
strong evidence to reject null hypothesis. This indicates statistically
significant differences in the energy use across the continents.
5. Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990?
# density plot to visualise the differences in imports of goods and services in two continents
data %>%
filter(Year>1990 & continent %in% c('Europe','Asia'))%>%
ggplot(aes(x=Imports.of.goods.and.services....of.GDP.,fill=continent))+
geom_density(alpha=0.3)+
labs(title="Imports of goods and services between Europe and Asia")
# stats
my_Data <- data %>%
filter(Year>1990)%>%
select(continent,Imports.of.goods.and.services....of.GDP.)%>%
filter(continent %in% c('Europe','Asia'))
t.test(Imports.of.goods.and.services....of.GDP. ~ continent,my_Data)
##
## Welch Two Sample t-test
##
## data: Imports.of.goods.and.services....of.GDP. by continent
## t = 1.3552, df = 137.53, p-value = 0.1776
## alternative hypothesis: true difference in means between group Asia and group Europe is not equal to 0
## 95 percent confidence interval:
## -2.321099 12.433240
## sample estimates:
## mean in group Asia mean in group Europe
## 46.84531 41.78924
Based on the results, the p-value of 0.1776 is greater than the
typical significance level of 0.05. This means we cannot reject the null
hypothesis indicating there is no significant difference in import of
goods and services between Asia and Europe.
6. What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years?
data%>%
group_by(Country.Name)%>%
summarise(mean=mean(Population.density..people.per.sq..km.of.land.area.))%>%
slice(which.max(mean))
## # A tibble: 1 × 2
## Country.Name mean
## <chr> <dbl>
## 1 Macao SAR, China 14732.
China has the highest ‘Population density (people per sq. km
of land area)’ across all years.
7. What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ between 1962 and 2007?
data %>%
filter(Year %in% c(1962,2007)) %>%
select(Year,Country.Name,Life.expectancy.at.birth..total..years.)%>%
group_by(Country.Name)%>%
pivot_wider(names_from = Year,values_from = Life.expectancy.at.birth..total..years.)%>%
mutate(diff_LE=`2007`-`1962`)%>%
arrange(desc(diff_LE))
## # A tibble: 263 × 4
## # Groups: Country.Name [263]
## Country.Name `1962` `2007` diff_LE
## <chr> <dbl> <dbl> <dbl>
## 1 Maldives 38.5 75.4 36.9
## 2 Bhutan 33.1 66.3 33.2
## 3 Timor-Leste 34.7 65.8 31.1
## 4 Tunisia 43.3 74.2 30.9
## 5 Oman 44.3 75.1 30.8
## 6 Nepal 36.0 66.6 30.6
## 7 China 44.4 74.3 29.9
## 8 Yemen, Rep. 34.7 62.0 27.2
## 9 Saudi Arabia 46.7 73.3 26.7
## 10 Iran, Islamic Rep. 46.1 72.7 26.6
## # ℹ 253 more rows
Maldives saw greatest increase in Life expectancy at birth between year 1962 and 2007.